Python comes with a built-in regular expression library re.
For strings that conform to certain textual patterns (such as dates, phone numbers, or ID numbers), you can design a regex matching rule to extract key information from the text.
In the regex matching rule, we abstract the expression of each character to derive the general form of the text.
At its simplest, to find the term βpythonβ in a string, you can write:
1 | import re |
Regular expressions have built-in special characters to match a broader range of terms, such as:
\d
matches any digit.
matches any single character*
repeats the previous regex match any number of times1 | regex_expr = ".*\d\d\d\d" |
Other special characters are not further elaborated. If forgotten, refer to: https://www.runoob.com/regexp/regexp-syntax.html π₯
Sometimes, a series of consecutive characters have the same regex expression. To simplify, you can use {}
.
In the above example, to match a 4-digit year, we used \d\d\d\d
. With curly brakets, it can be simplified to:
1 | \d{4} |
Sometimes, a character may have multiple styles.
For example, a certain character might be βaβ, βbβ, or βcβ. You can use []
to simplify the expression:
1 | [abc] |
[]
has some inherent expressions:
[0-9]
is equivalent to \d
[a-z]
matches any lowercase letter[.*?]
Punctuation within square brackets has no special meaning, but ^ inside square brackets indicates negation π‘Regex matches from left to right and will try to match as many fields as possible according to each segment of the rule. For example:
1 | line = "qeileeeeeeenn" |
It will match eeen
because qeileeee
was greedily matched by .*
.
You can use ?
for non-greedy matching, like:
1 | regex_expr = ".*?(e.+en).*" |
This will match eileeeeeeen
.
1 | import re |